If you get an error go to the packages panel, click install and type the name of the library (you only need to do this once).
library(fpp2)
library(tidyverse)
library(readxl)
library(lmtest) #contains coeftest function
library(TSstudio)
To install the next one you need to:
MLTools_0.0.31.tar.gz from Moodle (R
libraries section).library(MLTools)
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
fdata <- read_excel("ARMA_series.xls")
Visualize the first rows:
head(fdata)
## # A tibble: 6 × 6
## y1 y2 y3 y4 y5 y6
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0 0 0
## 2 0.153 -0.717 -1.71 0 1.67 -0.492
## 3 -0.407 -0.389 0.0653 1.45 -0.257 1.60
## 4 0.0383 0.393 0.509 1.85 -0.0995 -0.525
## 5 -0.0557 -1.40 -0.237 4.16 -1.09 0.320
## 6 0.0899 0.809 -0.630 2.95 3.49 -0.622
In this case:
ts if no
additional arguments are used.fdata_ts <- ts(fdata)
and we get some basic information
ts_info(fdata_ts)
## The fdata_ts series is a mts object with 6 variables and 250 observations
## Frequency: 1
## Start time: 1 1
## End time: 250 1
We use the column index to select a time series
y <- fdata_ts[ ,2]
We use them to identify significant lags and order
ggtsdisplay(y, lag.max = 20)
Now we propose a structure of the model by selecting the values of p and q
p <- 1
q <- 1
We fit the model with estimated order
arima.fit <- Arima(y, order=c(p, 0, q), include.mean = FALSE)
And we use summary to display the training errors and
estimated coefficients.
summary(arima.fit)
## Series: y
## ARIMA(1,0,1) with zero mean
##
## Coefficients:
## ar1 ma1
## -0.0231 -0.7444
## s.e. 0.0891 0.0643
##
## sigma^2 = 0.8641: log likelihood = -335.89
## AIC=677.79 AICc=677.88 BIC=688.35
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.03653454 0.9258414 0.7483762 297.631 350.5146 0.4578694
## ACF1
## Training set -0.003444551
The next function explores the statistical significance of the estimated coefficients
coeftest(arima.fit)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.023139 0.089100 -0.2597 0.7951
## ma1 -0.744401 0.064328 -11.5720 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We use the information to update the ARMA structure if needed. Using autoplot with the fitted model results in a root plot to check stationarity and invertibility. Keep in mind that this plots show inverse roots (that is, \(1/root\))
autoplot(arima.fit)
The residual plots can be used to check the hypothesis of the ARMA model:
CheckResiduals.ICAI(arima.fit, bins = 100, lag=20)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1) with zero mean
## Q* = 24.314, df = 18, p-value = 0.145
##
## Model df: 2. Total lags used: 20
Remember: you should also check the p-value of the Ljung-Box test in
the output of CheckResiduals.
An alternative residual plot is obtained with
ggtsdisplay(residuals(arima.fit), lag.max = 20)
Keep in mind that if the residuals are not white noise, you should change the order of the ARMA model.
We begin by reconstructing the series with the model. That is, we obtain the fitted value at \(t\) \[\hat{y}_{t|t-1}\] for all \(t\) in the training data.
The following plot compares the original time series with the reconstructed (fitted) values:
autoplot(y, series = "Real") +
forecast::autolayer(arima.fit$fitted, series = "Fitted")
Now, for a true forecast of future values we use the
horizon parameter h:
y_est <- forecast::forecast(arima.fit, h = 5)
autoplot(y_est)
The series we have been using are synthetic time series, obtained as in the example below:
set.seed(2024)
sim_ts <- arima.sim(n = 250,
list(ar = c(0.8897, -0.4858), ma = c(-0.2279, 0.2488)),
sd = sqrt(0.1796))
See (Hyndman and Athanasopoulos 2021), sections 9.3 and 9.4 for the conditions that the model coefficients should verify.
Let us go over the same dataset but this time we will use functions from other libraries to perform a basic train/test split and get some performance measures.
n <- length(y)
n_test <- floor(0.1 * n)
library(TSstudio)
y_split <- ts_split(y, sample.out = n_test)
y_TS <- y_split$test
y_TR <- y_split$train
Now we will briefly go over the same model identification and fitting steps, but using the training values.
ggtsdisplay(y_TR, lag.max = 20)
p <- 1
q <- 1
arima.fit <- Arima(y_TR, order=c(p, 0, q), include.mean = FALSE)
summary(arima.fit)
## Series: y_TR
## ARIMA(1,0,1) with zero mean
##
## Coefficients:
## ar1 ma1
## -0.0267 -0.7460
## s.e. 0.0912 0.0635
##
## sigma^2 = 0.885: log likelihood = -304.94
## AIC=615.87 AICc=615.98 BIC=626.12
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.03280064 0.93654 0.7566988 325.6628 382.3785 0.461487
## ACF1
## Training set -0.00351327
coeftest(arima.fit)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## ar1 -0.026683 0.091205 -0.2926 0.7699
## ma1 -0.746021 0.063459 -11.7560 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
autoplot(arima.fit)
CheckResiduals.ICAI(arima.fit, bins = 100, lag=20)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1) with zero mean
## Q* = 19.891, df = 18, p-value = 0.339
##
## Model df: 2. Total lags used: 20
Once we are satisfied with the model we can use it to get a true forecast covering the same time span as the test set:
y_fc <- forecast::forecast(arima.fit, h = n_test)
Keep in mind that this forecast object is more than just
a set of values: It is a list containing a lot of information about the
model we fitted and the forecast for the selected horizon, as suggested
by this plot:
autoplot(y_fc)
But if we want the forecasted values we can get them like this:
y_fc$mean
## Time Series:
## Start = 226
## End = 250
## Frequency = 1
## [1] 9.200863e-01 -2.455056e-02 6.550796e-04 -1.747941e-05 4.664012e-07
## [6] -1.244493e-08 3.320666e-10 -8.860494e-12 2.364235e-13 -6.308462e-15
## [11] 1.683279e-16 -4.491475e-18 1.198455e-19 -3.197824e-21 8.532716e-23
## [16] -2.276775e-24 6.075092e-26 -1.621010e-27 4.325321e-29 -1.154120e-30
## [21] 3.079526e-32 -8.217064e-34 2.192550e-35 -5.850355e-37 1.561043e-38
Note that as indicated by the plot these values decay to 0 very
quickly. We can also use test_forecast from TSstudio to get
a plot of the actual values, the model’s fitted values (for training or
in-sample), and these forecasted values (for the test set or
out of sample).
test_forecast(actual = y,
forecast.obj = y_fc,
test = y_TS)
We can get the model’s performance measures as follows:
forecast::accuracy(y_fc, y)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.03280064 0.93654 0.7566988 325.66275 382.37846 0.4614870
## Test set -0.03431537 1.01252 0.8809293 93.91238 97.21708 0.5372513
## ACF1 Theil's U
## Training set -0.00351327 NA
## Test set -0.48783927 0.8948283
Ypu can check that e.g. the RSME obtained here is the same as the direct computation:
sqrt(mean((y_est$mean - y_TS)^2))
## Warning in .cbind.ts(list(e1, e2), c(deparse(substitute(e1))[1L],
## deparse(substitute(e2))[1L]), : non-intersecting series
## [1] NaN
Recall the equation of the ARMA(p, q) process:
\[(1 - \phi_1 B - \phi_2 B^2)y_t = (1 - \theta_1 B - \theta_2 B^2)\epsilon_t\]
As a follow up pf the introduction to stochastic processes in the previous session we will see how to generate a time series that is a realization of one such process. We use two vectors of coefficients for the AR and MA parts (with \(p\leq 2, q\leq 2\)). See the code lines below and make sure you choose the signs correctly.
ar_cff <- c(0, 0)
ma_cff <- c(1/3, 0)
sd <- sqrt(0.1796)
We will create a gaussian white noise time series. In order to do that we get a sample of \(n\) random values from a standard normal \(Z \sim N(0, 1)\).
set.seed(2024)
n <- 800
n <- 250
w <- ts(rnorm(n, mean = 0, sd = sd) )
Now we generate a time series \(y_t\) using the ARMA(p, q) process. In order to start the process we assume that the values of \(y_t, w_t\) for \(t < 1\) are 0.
y_sim <- numeric(n)
y_sim[1] <- w[1]
y_sim[2] <- w[2] + ar_cff[1] * y_sim[1] - ma_cff[1] * w[1]
for(i in 3:n){
y_sim[i] <- ar_cff[1] * y_sim[i - 1] + ar_cff[2] * y_sim[i - 2] +
w[i] - ma_cff[1] * w[i - 1] - ma_cff[2] * w[i - 2]
}
y_sim <- ts(y_sim)
Our generated y_sim time series is not as good as the one we obtained
previously with arima.sim but, as the following plot
illustrates, it exhibits a very similar dynamic.
ggtsdisplay(y_sim, lag.max = 20)
y_sim?y_sim?